In your final group assignment you have to analyse data about Airbnb listings and fit a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. You can download AirBnB data from insideairbnb.com; it was originally scraped from airbnb.com.

1 Exploratory Data Analysis (EDA)

You may wish to have a level 1 header (#) for your EDA, then use level 2 sub-headers (##) to make sure you cover all three EDA bases. At a minimum you should address these questions:

  • How many variables/columns? How many rows/observations?
  • Which variables are numbers?
  • Which are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values?
  • What are the correlations between variables? Does each scatter plot support a linear relationship between variables? Do any of the correlations appear to be conditional on the value of a categorical variable?

At this stage, you may also find you want to use filter, mutate, arrange, select, or count. Let your questions lead you!

1.1 Examining the raw dataset

glimpse(listings)
Rows: 6,566
Columns: 74
$ id                                           <dbl> 958, 5858, 7918, 8142, 83…
$ listing_url                                  <chr> "https://www.airbnb.com/r…
$ scrape_id                                    <dbl> 2.021101e+13, 2.021101e+1…
$ last_scraped                                 <date> 2021-10-06, 2021-10-06, …
$ name                                         <chr> "Bright, Modern Garden Un…
$ description                                  <chr> "Please check local laws …
$ neighborhood_overview                        <chr> "Quiet cul de sac in frie…
$ picture_url                                  <chr> "https://a0.muscache.com/…
$ host_id                                      <dbl> 1169, 8904, 21994, 21994,…
$ host_url                                     <chr> "https://www.airbnb.com/u…
$ host_name                                    <chr> "Holly", "Philip And Tani…
$ host_since                                   <date> 2008-07-31, 2009-03-02, …
$ host_location                                <chr> "San Francisco, Californi…
$ host_about                                   <chr> "We are a family of four …
$ host_response_time                           <chr> "within an hour", "N/A", …
$ host_response_rate                           <chr> "100%", "N/A", "100%", "1…
$ host_acceptance_rate                         <chr> "92%", "N/A", "100%", "10…
$ host_is_superhost                            <lgl> TRUE, FALSE, FALSE, FALSE…
$ host_thumbnail_url                           <chr> "https://a0.muscache.com/…
$ host_picture_url                             <chr> "https://a0.muscache.com/…
$ host_neighbourhood                           <chr> "Duboce Triangle", "Berna…
$ host_listings_count                          <dbl> 1, 2, 10, 10, 2, 2, 1, 0,…
$ host_total_listings_count                    <dbl> 1, 2, 10, 10, 2, 2, 1, 0,…
$ host_verifications                           <chr> "['email', 'phone', 'face…
$ host_has_profile_pic                         <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ host_identity_verified                       <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ neighbourhood                                <chr> "San Francisco, Californi…
$ neighbourhood_cleansed                       <chr> "Western Addition", "Bern…
$ neighbourhood_group_cleansed                 <lgl> NA, NA, NA, NA, NA, NA, N…
$ latitude                                     <dbl> 37.77028, 37.74474, 37.76…
$ longitude                                    <dbl> -122.4332, -122.4209, -12…
$ property_type                                <chr> "Entire serviced apartmen…
$ room_type                                    <chr> "Entire home/apt", "Entir…
$ accommodates                                 <dbl> 3, 5, 2, 2, 4, 3, 4, 2, 3…
$ bathrooms                                    <lgl> NA, NA, NA, NA, NA, NA, N…
$ bathrooms_text                               <chr> "1 bath", "1 bath", "4 sh…
$ bedrooms                                     <dbl> 1, 2, 1, 1, 2, 1, 2, NA, …
$ beds                                         <dbl> 2, 3, 1, 1, 2, 1, 3, 1, 3…
$ amenities                                    <chr> "[\"Keypad\", \"Refrigera…
$ price                                        <chr> "$160.00", "$235.00", "$5…
$ minimum_nights                               <dbl> 2, 30, 32, 32, 7, 13, 30,…
$ maximum_nights                               <dbl> 30, 60, 60, 90, 111, 14, …
$ minimum_minimum_nights                       <dbl> 2, 30, 32, 32, 7, 13, 30,…
$ maximum_minimum_nights                       <dbl> 2, 30, 32, 32, 7, 13, 30,…
$ minimum_maximum_nights                       <dbl> 1125, 60, 60, 90, 111, 14…
$ maximum_maximum_nights                       <dbl> 1125, 60, 60, 90, 111, 14…
$ minimum_nights_avg_ntm                       <dbl> 2, 30, 32, 32, 7, 13, 30,…
$ maximum_nights_avg_ntm                       <dbl> 1125, 60, 60, 90, 111, 14…
$ calendar_updated                             <lgl> NA, NA, NA, NA, NA, NA, N…
$ has_availability                             <lgl> TRUE, TRUE, TRUE, TRUE, T…
$ availability_30                              <dbl> 6, 30, 30, 11, 30, 23, 4,…
$ availability_60                              <dbl> 12, 60, 60, 41, 60, 47, 2…
$ availability_90                              <dbl> 18, 90, 90, 71, 90, 77, 5…
$ availability_365                             <dbl> 104, 365, 365, 346, 365, …
$ calendar_last_scraped                        <date> 2021-10-06, 2021-10-06, …
$ number_of_reviews                            <dbl> 302, 111, 19, 8, 28, 736,…
$ number_of_reviews_ltm                        <dbl> 40, 0, 0, 0, 0, 1, 2, 0, …
$ number_of_reviews_l30d                       <dbl> 5, 0, 0, 0, 0, 0, 0, 0, 0…
$ first_review                                 <date> 2014-10-05, 2009-11-24, …
$ last_review                                  <date> 2021-09-17, 2015-08-28, …
$ review_scores_rating                         <dbl> 4.87, 4.88, 4.20, 4.63, 4…
$ review_scores_accuracy                       <dbl> 4.94, 4.85, 3.73, 4.38, 4…
$ review_scores_cleanliness                    <dbl> 4.95, 4.87, 3.87, 4.38, 5…
$ review_scores_checkin                        <dbl> 4.96, 4.89, 4.67, 4.75, 4…
$ review_scores_communication                  <dbl> 4.90, 4.85, 4.60, 4.75, 5…
$ review_scores_location                       <dbl> 4.98, 4.77, 4.73, 4.63, 4…
$ review_scores_value                          <dbl> 4.78, 4.68, 4.00, 4.63, 4…
$ license                                      <chr> "City Registration Pendin…
$ instant_bookable                             <lgl> FALSE, FALSE, FALSE, FALS…
$ calculated_host_listings_count               <dbl> 1, 1, 9, 9, 2, 2, 1, 1, 2…
$ calculated_host_listings_count_entire_homes  <dbl> 1, 1, 0, 0, 2, 0, 1, 1, 2…
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 9, 9, 0, 2, 0, 0, 0…
$ calculated_host_listings_count_shared_rooms  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0…
$ reviews_per_month                            <dbl> 3.54, 0.77, 0.17, 0.10, 0…
skim(listings)
Data summary
Name listings
Number of rows 6566
Number of columns 74
_______________________
Column type frequency:
character 24
Date 5
logical 8
numeric 37
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
listing_url 0 1.00 32 37 0 6566 0
name 0 1.00 2 94 0 6190 0
description 75 0.99 14 1000 0 5850 0
neighborhood_overview 1777 0.73 9 1000 0 3642 0
picture_url 0 1.00 60 126 0 6268 0
host_url 0 1.00 38 43 0 3402 0
host_name 14 1.00 1 42 0 1850 0
host_location 20 1.00 2 62 0 236 0
host_about 1932 0.71 1 3409 0 2356 3
host_response_time 14 1.00 3 18 0 5 0
host_response_rate 14 1.00 2 4 0 48 0
host_acceptance_rate 14 1.00 2 4 0 92 0
host_thumbnail_url 14 1.00 55 106 0 3393 0
host_picture_url 14 1.00 57 109 0 3393 0
host_neighbourhood 418 0.94 3 31 0 162 0
host_verifications 0 1.00 4 152 0 248 0
neighbourhood 1777 0.73 28 54 0 6 0
neighbourhood_cleansed 0 1.00 6 21 0 36 0
property_type 0 1.00 4 35 0 52 0
room_type 0 1.00 10 15 0 4 0
bathrooms_text 10 1.00 6 17 0 30 0
amenities 0 1.00 27 1746 0 5585 0
price 0 1.00 5 10 0 593 0
license 2735 0.58 3 426 0 1648 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
last_scraped 0 1.00 2021-10-06 2021-10-06 2021-10-06 1
host_since 14 1.00 2008-07-31 2021-09-28 2015-02-02 2127
calendar_last_scraped 0 1.00 2021-10-06 2021-10-06 2021-10-06 1
first_review 1397 0.79 2009-09-25 2021-10-04 2018-12-26 2181
last_review 1397 0.79 2010-10-04 2021-10-05 2021-07-10 1090

Variable type: logical

skim_variable n_missing complete_rate mean count
host_is_superhost 14 1 0.44 FAL: 3670, TRU: 2882
host_has_profile_pic 14 1 0.99 TRU: 6490, FAL: 62
host_identity_verified 14 1 0.85 TRU: 5592, FAL: 960
neighbourhood_group_cleansed 6566 0 NaN :
bathrooms 6566 0 NaN :
calendar_updated 6566 0 NaN :
has_availability 0 1 0.99 TRU: 6487, FAL: 79
instant_bookable 0 1 0.36 FAL: 4205, TRU: 2361

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
id 0 1.00 2.714515e+07 16412399.00 9.580000e+02 1.289131e+07 2.836578e+07 4.156947e+07 5.263300e+07 ▇▅▆▇▇
scrape_id 0 1.00 2.021101e+13 0.00 2.021101e+13 2.021101e+13 2.021101e+13 2.021101e+13 2.021101e+13 ▁▁▇▁▁
host_id 0 1.00 8.383933e+07 109872537.10 1.169000e+03 4.562696e+06 2.648276e+07 1.225316e+08 4.250843e+08 ▇▂▁▁▁
host_listings_count 14 1.00 7.264000e+01 318.97 0.000000e+00 1.000000e+00 2.000000e+00 1.200000e+01 1.987000e+03 ▇▁▁▁▁
host_total_listings_count 14 1.00 7.264000e+01 318.97 0.000000e+00 1.000000e+00 2.000000e+00 1.200000e+01 1.987000e+03 ▇▁▁▁▁
latitude 0 1.00 3.777000e+01 0.02 3.771000e+01 3.775000e+01 3.777000e+01 3.779000e+01 3.781000e+01 ▂▃▆▇▅
longitude 0 1.00 -1.224300e+02 0.03 -1.225100e+02 -1.224400e+02 -1.224200e+02 -1.224100e+02 -1.223700e+02 ▁▂▅▇▁
accommodates 0 1.00 3.090000e+00 1.83 0.000000e+00 2.000000e+00 2.000000e+00 4.000000e+00 1.600000e+01 ▇▅▁▁▁
bedrooms 933 0.86 1.510000e+00 0.86 1.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 9.000000e+00 ▇▁▁▁▁
beds 66 0.99 1.720000e+00 1.22 0.000000e+00 1.000000e+00 1.000000e+00 2.000000e+00 1.400000e+01 ▇▂▁▁▁
minimum_nights 0 1.00 2.327000e+01 49.32 1.000000e+00 2.000000e+00 3.000000e+01 3.000000e+01 1.125000e+03 ▇▁▁▁▁
maximum_nights 0 1.00 4.935600e+02 541.98 1.000000e+00 2.900000e+01 1.800000e+02 1.125000e+03 1.000000e+04 ▇▁▁▁▁
minimum_minimum_nights 2 1.00 2.410000e+01 55.20 1.000000e+00 2.000000e+00 3.000000e+01 3.000000e+01 1.125000e+03 ▇▁▁▁▁
maximum_minimum_nights 2 1.00 3.962000e+01 116.87 1.000000e+00 2.000000e+00 3.000000e+01 3.000000e+01 1.125000e+03 ▇▁▁▁▁
minimum_maximum_nights 2 1.00 6.874500e+02 548.24 1.000000e+00 7.000000e+01 1.125000e+03 1.125000e+03 1.000000e+04 ▇▁▁▁▁
maximum_maximum_nights 2 1.00 7.525390e+06 126905437.12 1.000000e+00 9.000000e+01 1.125000e+03 1.125000e+03 2.147484e+09 ▇▁▁▁▁
minimum_nights_avg_ntm 2 1.00 3.896000e+01 113.93 1.000000e+00 2.000000e+00 3.000000e+01 3.000000e+01 1.125000e+03 ▇▁▁▁▁
maximum_nights_avg_ntm 2 1.00 7.508364e+06 126618320.95 1.000000e+00 9.000000e+01 1.125000e+03 1.125000e+03 2.142625e+09 ▇▁▁▁▁
availability_30 0 1.00 8.960000e+00 11.09 0.000000e+00 0.000000e+00 3.000000e+00 1.700000e+01 3.000000e+01 ▇▁▁▁▂
availability_60 0 1.00 2.278000e+01 22.77 0.000000e+00 0.000000e+00 1.800000e+01 4.300000e+01 6.000000e+01 ▇▂▂▂▃
availability_90 0 1.00 3.915000e+01 34.13 0.000000e+00 0.000000e+00 3.600000e+01 7.000000e+01 9.000000e+01 ▇▂▂▃▅
availability_365 0 1.00 1.606500e+02 134.11 0.000000e+00 2.200000e+01 1.420000e+02 3.000000e+02 3.650000e+02 ▇▃▂▂▆
number_of_reviews 0 1.00 4.422000e+01 84.37 0.000000e+00 1.000000e+00 7.000000e+00 4.600000e+01 8.610000e+02 ▇▁▁▁▁
number_of_reviews_ltm 0 1.00 6.140000e+00 15.27 0.000000e+00 0.000000e+00 1.000000e+00 4.000000e+00 3.730000e+02 ▇▁▁▁▁
number_of_reviews_l30d 0 1.00 7.200000e-01 2.08 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 5.100000e+01 ▇▁▁▁▁
review_scores_rating 1397 0.79 4.730000e+00 0.56 0.000000e+00 4.710000e+00 4.890000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_accuracy 1429 0.78 4.820000e+00 0.40 0.000000e+00 4.800000e+00 4.940000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_cleanliness 1429 0.78 4.760000e+00 0.43 0.000000e+00 4.710000e+00 4.910000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_checkin 1430 0.78 4.880000e+00 0.32 0.000000e+00 4.890000e+00 4.980000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_communication 1429 0.78 4.860000e+00 0.37 1.000000e+00 4.880000e+00 4.980000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_location 1430 0.78 4.800000e+00 0.39 0.000000e+00 4.770000e+00 4.910000e+00 5.000000e+00 5.000000e+00 ▁▁▁▁▇
review_scores_value 1430 0.78 4.660000e+00 0.45 0.000000e+00 4.580000e+00 4.760000e+00 4.900000e+00 5.000000e+00 ▁▁▁▁▇
calculated_host_listings_count 0 1.00 1.510000e+01 32.60 1.000000e+00 1.000000e+00 2.000000e+00 1.000000e+01 1.510000e+02 ▇▁▁▁▁
calculated_host_listings_count_entire_homes 0 1.00 1.071000e+01 31.45 0.000000e+00 0.000000e+00 1.000000e+00 2.000000e+00 1.510000e+02 ▇▁▁▁▁
calculated_host_listings_count_private_rooms 0 1.00 3.760000e+00 9.93 0.000000e+00 0.000000e+00 0.000000e+00 2.000000e+00 5.600000e+01 ▇▁▁▁▁
calculated_host_listings_count_shared_rooms 0 1.00 4.200000e-01 2.87 0.000000e+00 0.000000e+00 0.000000e+00 0.000000e+00 2.600000e+01 ▇▁▁▁▁
reviews_per_month 1397 0.79 1.940000e+00 5.20 1.000000e-02 2.200000e-01 6.900000e-01 2.100000e+00 1.260000e+02 ▇▁▁▁▁

Comment: The dataset for San Francisco’s Airbnb listings has 74 variables with 6,566 rows. Of these variables, 37 are numeric, 24 are character, 8 are logical, and 5 are date. Of course, not all of these variables are integral to the creation of a price model, and we need to be able to separate the signal from the noise by tidying the data.

In our dataset, categorical or ‘factor’ variables include the numeric variables such as availability_X, and review_scores_X, the 8 logical variables which only have values TRUE or FALSE, and certain character variables such as host_neighbourhood and neighbourhood_cleansed. Understanding which variables have a fixed and known set of possible values is important because it allows us to focus and restrict our analysis. In the case of categorical variables which are numeric, it also allows us to use linear instead of logarithmic scales.

In particular, we have decided on some specific variables of interest to conduct our regression analysis. They are: accommodates, bedrooms, beds, number_of_reviews, review_scores_rating, review_scores_value, minimum_nights, and maximum_nights.

1.2 Summary Statistics for Variables of Interest

1.2.1 Number of people an AirBNB can accommodate

favstats(listings$accommodates)
 min Q1 median Q3 max     mean       sd    n missing
   0  2      2  4  16 3.094883 1.833757 6566       0

1.2.2 Number of bedrooms

favstats(listings$bedrooms)
 min Q1 median Q3 max     mean       sd    n missing
   1  1      1  2   9 1.514291 0.857692 5633     933

1.2.3 Number of beds

favstats(listings$beds)
 min Q1 median Q3 max     mean       sd    n missing
   0  1      1  2  14 1.723538 1.223217 6500      66

1.2.4 Number of reviews

favstats(listings$number_of_reviews)
 min Q1 median Q3 max     mean       sd    n missing
   0  1      7 46 861 44.22373 84.36606 6566       0

1.2.5 Rating of reviews

favstats(listings$review_scores_rating)
 min   Q1 median Q3 max    mean        sd    n missing
   0 4.71   4.89  5   5 4.73348 0.5575404 5169    1397

1.2.6 Value of rating score

favstats(listings$review_scores_value)
 min   Q1 median  Q3 max    mean        sd    n missing
   0 4.58   4.76 4.9   5 4.66134 0.4450693 5136    1430

1.2.7 Minimum nights

favstats(listings$minimum_nights)
 min Q1 median Q3  max     mean       sd    n missing
   1  2     30 30 1125 23.26592 49.32259 6566       0

1.2.8 Maximum Nights

favstats(listings$maximum_nights)
 min Q1 median   Q3   max     mean       sd    n missing
   1 29    180 1125 10000 493.5577 541.9797 6566       0

In all cases, please think about the message your plot is conveying. Don’t just say “This is my X-axis, this is my Y-axis”, but rather what’s the so what of the plot. Tell some sort of story and speculate about the differences in the patterns in no more than a paragraph.

Comment:

1.3 Data wrangling

listings_1 <- listings %>% 
  drop_na(c(host_is_superhost,host_has_profile_pic,host_identity_verified,instant_bookable,# logical variables
          bedrooms,beds,review_scores_rating,reviews_per_month,# numerical variables
          host_response_time,host_response_rate,host_acceptance_rate,bathrooms_text)) #char

# But we still have some "N/A"(not NA), so we need to drop them as well
na <- c(listings_1$host_response_time,listings_1$host_response_rate,listings_1$host_acceptance_rate)

n <- grep("N/A",na) # choose rows' index that contain "N/A"

listings_2 <- listings_1[-n,] %>% 
  mutate(price = parse_number(price),
         host_response_rate=parse_number(host_response_rate),
         host_acceptance_rate=parse_number(host_acceptance_rate))
         #host_is_superhost = factor(host_is_superhost, levels = c("TRUE","FALSE")),
         #host_identity_verified = factor(host_identity_verified, levels = c("TRUE","FALSE")))

listings_2         
# A tibble: 3,467 × 74
      id listing_url  scrape_id last_scraped name   description neighborhood_ov…
   <dbl> <chr>            <dbl> <date>       <chr>  <chr>       <chr>           
 1   958 https://www…   2.02e13 2021-10-06   Brigh… "Please ch… Quiet cul de sa…
 2  7918 https://www…   2.02e13 2021-10-06   A Fri… "Nice and … Shopping old to…
 3  8142 https://www…   2.02e13 2021-10-06   Frien… "Nice and … <NA>            
 4  8339 https://www…   2.02e13 2021-10-06   Histo… "Pls email… <NA>            
 5  8739 https://www…   2.02e13 2021-10-06   Missi… "Welcome t… Located between…
 6 10820 https://www…   2.02e13 2021-10-06   Haigh… "This prop… Neighborhood: H…
 7 10824 https://www…   2.02e13 2021-10-06   Victo… "This prop… Neighborhood: H…
 8 10832 https://www…   2.02e13 2021-10-06   Union… "This prop… Neighborhood: D…
 9 12041 https://www…   2.02e13 2021-10-06   Sunny… "Nice and … Small shopping …
10 12042 https://www…   2.02e13 2021-10-06   Sunny… "Settle do… <NA>            
# … with 3,457 more rows, and 67 more variables: picture_url <chr>,
#   host_id <dbl>, host_url <chr>, host_name <chr>, host_since <date>,
#   host_location <chr>, host_about <chr>, host_response_time <chr>,
#   host_response_rate <dbl>, host_acceptance_rate <dbl>,
#   host_is_superhost <lgl>, host_thumbnail_url <chr>, host_picture_url <chr>,
#   host_neighbourhood <chr>, host_listings_count <dbl>,
#   host_total_listings_count <dbl>, host_verifications <chr>, …
typeof(listings$price)
[1] "character"

1.4 Visualization

density_price_plot <- ggplot(listings_2, aes(x=price))+
                      geom_density()+
                      theme_bw()+
                      labs(title = "Price Distribution",
                           subtitle = "Density Plot",
                           x = "Price",
                           y = "Density")+
                      NULL

density_price_plot

Comment: The density plot for price distribution is heavily skewed right in our visualisation which shows that the Airbnb prices in San Francisco all tend to hover around a similar range of numbers. This makes sense as Airbnb only offers rental services, so prices should not differ that drastically from one another. In the next graph, we can see how we can use logarithmic scales instead.

density_log_price <-  ggplot(listings_2, aes(x=price)) +
                      geom_density()+
                      theme_bw()+
                      scale_x_log10()+
                      labs(title = "Price Distribution",
                           subtitle = "Density Plot",
                           x = "Log (Price) ",
                           y = "Density")+
                      NULL
density_log_price

Comment: To better visualise this data, therefore, we can use a logarithmic scale which makes the data look like a more typical normal distribution which is still skewed right. The reason the graph is still skewed negatively is because there is a much higher probability of Airbnb prices being higher and expensive than them being close to 50 or cheaper.

availability_price <- ggplot(listings_2, aes(x = availability_30, y = log(price)))+
                      geom_col()+
                      theme_bw()+
  labs(title = "Availability for 30 days vs Price",
                           subtitle = "Bar Chart",
                           x = "Rooms available within 30 days",
                           y = "Log (Price)")+
                      NULL

availability_price

host_response_density <- ggplot(listings_2,aes(x=host_response_rate))+
                         geom_density()+
                         theme_bw()+
   labs(title = "Host Response Rate",
                           subtitle = "Density Plot",
                           x = "Response rate",
                           y = "Density")+
                         NULL

host_response_density

host_acceptance_density <- ggplot(listings_2,
                           aes(x=host_acceptance_rate))+
                           geom_density()+
                           theme_bw()+
   labs(title = "Host Response Rate",
                           subtitle = "Density Plot",
                           x = "Host Response Rate",
                           y = "Density")+
                           NULL

host_acceptance_density

number_of_reviews_density <- ggplot(listings_2,aes(x=number_of_reviews), binwidth=5)+
                  geom_density()+
                  theme_bw()+
  labs(title = "Number of Reviews",
                           subtitle = "Density Plot",
                           x = "Number of reviews",
                           y = "Density")+
                  NULL

number_of_reviews_density

rating_density <- ggplot(listings_2,aes(x=review_scores_rating))+
                  geom_density()+
                  theme_bw()+
   labs(title = "Review Rating",
                           subtitle = "Density Plot",
                           x = "Rating",
                           y = "Density")+
                  NULL

rating_density

Comment: Availability for 30 days vs Price: As we can see from this chart, the price of those properties available immediately is significantly higher than those only available in several days from now. This is in line with the basic economic proposition of supply and demand and how it affects prices, as the supply of rooms available immediately will be relatively small, while the people who embody the demand for these rooms will tend to be rather desperate and have little other choice. Therefore, the prices for rooms available immediately can be set much higher. In addition, we can see a spike for rooms available 30 days from now, as there similarly may be heightened demand for rooms further into the future as some organised people want to book their Airbnbs far in advance.

Host Response Rate Density Plot: The density plot for host response rates shows that the vast majority of host response rates are between 90% and 100%. If they were much less than 90%, it is unlikely that anyone using Airbnb would think their properties to be reliable enough to pay for them.

Number of Reviews Density Plot: The number of reviews density plot is heavily skewed to the right. This is because our axes go up to more than 750 and in reality, each individual property will not have more than 25 reviews. Writing reviews takes time and most consumers tend not to bother taking this time to write reviews unless they feel very passionately about their experience.

Review Rating Density Plot: Most ratings for the Airbnb properties are clustered around the 4.5-5 mark. If they were any less than this, most consumers would probably simply avoid them. Moreover, it is also possible that Airbnb has policies around taking off any listings below a certain number, maybe around 4, in the same way that Uber requires all of its drivers to be above a certain threshold to be able to continue driving for them.

superhost_price <- ggplot(listings_2, aes(x=log(price), y=host_is_superhost, fill= host_is_superhost))+
                   geom_boxplot()+
                   theme_bw()+
                   theme(legend.position = "none")+
                   labs(title = " Relationship between Superhost and Price ",
                        subtitle = "Box Plot",
                        x = "Log(Price)",
                        y = "Superhost")+
                   NULL
superhost_price

superhost_price_density <- listings_2 %>%
                          ggplot(aes(x=log(price), color= host_is_superhost))+
                               geom_density()+
                               facet_wrap(~host_is_superhost)+
                               theme_bw()+
                          theme(legend.position = "none")+
  labs(title = "Relationship between Superhost and Price",
                           subtitle = "Density Plots",
                           x = "Log(Price)",
                           y = "Density")+
                           NULL
superhost_price_density

superhost_reviews <- ggplot(listings_2, aes(x=number_of_reviews, y=host_is_superhost, fill= host_is_superhost))+
                     geom_col()+
                     theme_bw()+
                     theme(legend.position = "none")+
  labs(title = " Relationship between Superhost and reviews",
                           subtitle = "Bar Chart",
                           x = "Number of Reviews",
                           y = "Superhost")+
                           NULL
superhost_reviews

superhost_rating <- ggplot(listings_2, aes(x=review_scores_rating, y=host_is_superhost, fill= host_is_superhost))+
                    geom_col()+
                    theme_bw()+
                    theme(legend.position = "none")+
  labs(title = " Relationship between Superhost and ratings",
                           subtitle = "Bar Chart",
                           x = "Ratings",
                           y = "Superhost")+
                           NULL
superhost_rating

Comment: Relationship between Superhost and Price: Somewhat surprisingly, the prices for properties set by superhosts versus non-superhosts is roughly similar. The distribution is also very similar as can be shown by both the box plot and density plot. What this could suggest is that superhosts on Airbnb do not have a significant impact on customers’ view on the properties. However, as we analyse the relationship between superhost and reviews and ratings, we can conclude that being a superhost does make a difference. One possible reason that superhosts and non-superhosts can still have properties at the same prices is that some customers are more sensitive to other factors such as location.

Relationship between Superhost and Reviews: Superhosts accumulate many more reviews than non-superhosts, with more than 150,000 compared to just over 50,000 for non-superhosts. If we look into Airbnb guidelines, this is consistent with the very high response rate required of Airbnb superhosts. Each superhost must maintain a response rate of 90% or higher, which may incentivise more people to leave reviews if they feel as if they are almost certain to receive a response.

Relationship between Superhost and Ratings: Superhosts also attract more ratings than non-superhosts. Again, this can be attributed to the closer relationship superhosts must strive to maintain with their guests in order to preserve their superhost designation.

host_verified_price <- ggplot(listings_2, aes(x=log(price), y=host_identity_verified, fill= host_identity_verified))+
                       geom_boxplot()+
                       theme_bw()+
                       theme(legend.position = "none")+
   labs(title = "Relationship between a verified host and price",
                           subtitle = "Boxplot",
                           x = "Log(Price)",
                           y = "Verified Host")+
                           NULL
host_verified_price

host_verified_price_density <- ggplot(listings_2, aes(x=log(price), colour = host_identity_verified))+
                               geom_density()+
                               facet_wrap(~host_identity_verified)+
                               theme_bw()+
                               theme(legend.position = "none")+
  labs(title = "Relationship between a verified host and price ",
                           subtitle = "Density Plot",
                           x = "Log(Price)",
                           y = "Density")+
                           NULL
host_verified_price_density

host_verified_reviews <- ggplot(listings_2, aes(x=number_of_reviews, y=host_identity_verified, fill= host_identity_verified))+
                         geom_col()+
                         theme_bw()+
                         theme(legend.position = "none")+
  labs(title = "Relationship between a verified host and reviews ",
                           subtitle = "Bar Chart",
                           x = "Number of Reviews",
                           y = "Verified Host")+
                           NULL
host_verified_reviews

host_verified_rating <- ggplot(listings_2, aes(x=review_scores_rating, y=host_identity_verified, fill= host_identity_verified))+
                        geom_col()+
                        theme_bw()+ 
                        theme(legend.position = "none")+
   labs(title = "Relationship between a verified host and ratings",
                           subtitle = "Bar Chart",
                           x = "Ratings",
                           y = "Verified Host")+
                           NULL
host_verified_rating

Comment: Relationship between a verified host and price: Unlike with the relationship between superhosts and price, there is a marked difference in the prices a verified host and non-verified host can charge. This can be explained by the stricter regulations around becoming a verified host on Airbnb. To become a verified host, one needs to provide Airbnb with government ID, whereas Superhosts in contrast only need to have fulfilled requirements to do with minimum response rates, minimum cancellation rates, and minimum ratings.

Relationship between a verified host and reviews: Verified hosts receive a much higher volume of reviews than non-verified hosts. This may be attributed to the fact that some verified hosts also require their guests to be verified, in which case they may be avid Airbnb users and be more likely to contribute regular reviews to the properties they stay in.

Relationship between a verified host and ratings: Consistent with the relationship between a verified host and the number of reviews, the cumulative ratings for a verified host are much higher than for a non-verified host. The relationship between a verified host and ratings is very much comparable to that between a verified host and reviews.

1.5 Propery types

Next, we look at the variable property_type. We can use the count function to determine how many categories there are their frequency. What are the top 4 most common property types? What proportion of the total listings do they make up?

Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable that has 5 categories: the top four categories and Other. Fill in the code below to create prop_type_simplified.

listings_2 %>% 
  group_by(property_type) %>% 
  summarise(num_property_type = count(property_type)) %>% 
  arrange(desc(num_property_type))
# A tibble: 43 × 2
   property_type                       num_property_type
   <chr>                                           <int>
 1 Entire rental unit                                867
 2 Private room in residential home                  527
 3 Entire residential home                           486
 4 Entire condominium (condo)                        334
 5 Entire guest suite                                283
 6 Private room in rental unit                       246
 7 Room in boutique hotel                            151
 8 Private room in condominium (condo)                99
 9 Room in hotel                                      59
10 Entire serviced apartment                          52
# … with 33 more rows
listings_3 <- listings_2 %>%
  mutate(prop_type_simplified = case_when(
    property_type %in% c("Entire rental unit", "Private room in residential home","Entire residential home","Entire condominium (condo)") ~ property_type, 
    TRUE ~ "Other"))

Comment: The top 4 most common property types are, in order of most common to least common: entire rental unit, private room in residential home, entire residential home, and entire condominium (condo). These top 4 most common property types make up around 64% of all property listings.

listings_3 %>%
  count(prop_type_simplified) %>%
  arrange(desc(n)) 
# A tibble: 5 × 2
  prop_type_simplified                 n
  <chr>                            <int>
1 Other                             1253
2 Entire rental unit                 867
3 Private room in residential home   527
4 Entire residential home            486
5 Entire condominium (condo)         334

1.5.1 Relationship between price and property type

log_price_prop <-  ggplot(listings_3, aes(x=log(price), color= prop_type_simplified)) +
                      geom_density()+
                      theme_bw()+
                      facet_wrap(~prop_type_simplified, nrow=1)+
                      labs(title = "Price distribution for different property types",
                           subtitle = "Density Plot",
                           x = "Log (Price) ",
                           y = "Density")+
                      theme(legend.position = "none")+
                      NULL
log_price_prop

Comment: The prices for the property types of entire rental unit and private room in residential home are especially closely packed together. This may be attributed to the fact that these 2 property types are the most popular and have the highest number of listings, and thus have a lot more properties available at relatively similar prices.

1.6 Correlation:

data_for_correlation <-listings_3 %>% 
  select(availability_30, bedrooms, beds, host_listings_count, number_of_reviews_l30d, review_scores_rating)   
  correlation_matrix <- cor(data_for_correlation) 
  ggcorrplot(correlation_matrix, hc.order= TRUE, lab = TRUE, colors= c("#CB454A", "white", "#7DCD85"))+
    labs(title = "Correlation Matrix")

ggpairs(data_for_correlation)+
    theme_bw()+
    labs(title = "Relationship between selected variables",
         subtitles = "Correlation with scatter and density plots")

Comment: The most noteworthy thing to observe in this correlation matrix and the relationship between selected variables is the relatively high correlation between beds and bedrooms (correlation of 0.756). This is expected as in general, the greater the number of bedrooms, the greater the number of beds in a given property as well. This may result in some multicollinearity in our regression analysis which will result in less reliable statistical inferences. Other than this, the correlation between these selected variables are all relatively low. The only other exception is the relationship between availability_30 and review_scores_rating (correlation of -0.168). This also can be rationalised by the fact that the longer a property is available shows that demand is scarce for the property and that people might not rate that property highly. Nevertheless, this is still only very marginally correlated (correlation of -0.168), so we should not worry about this too much.

2 Mapping

Visualizations of feature distributions and their relations are key to understanding a data set, and they can open up new lines of exploration. While we do not have time to go into all the wonderful geospatial visualizations one can do with R, you can use the following code to start with a map of your city, and overlay all AirBnB coordinates to get an overview of the spatial distribution of AirBnB rentals. For this visualization we use the leaflet package, which includes a variety of tools for interactive maps, so you can easily zoom in-out, click on a point to get the actual AirBnB listing for that specific point, etc.

The following code, having downloaded a dataframe listings with all AirbnB listings in Milan, will plot on the map all AirBnBs where minimum_nights is less than equal to four (4). You could learn more about leaflet, by following the relevant Datacamp course on mapping with leaflet

leaflet(data = filter(listings_3, minimum_nights <= 4)) %>% 
  addProviderTiles("OpenStreetMap.Mapnik") %>% 
  addCircleMarkers(lng = ~longitude, 
                   lat = ~latitude, 
                   radius = 1, 
                   fillColor = "blue", 
                   fillOpacity = 0.4, 
                   popup = ~listing_url,
                   label = ~property_type)

3 Regression Analysis

For the target variable \(Y\), we will use the cost for two people to stay at an Airbnb location for four (4) nights.

Create a new variable called price_4_nights that uses price, and accomodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.

Use histograms or density plots to examine the distributions of price_4_nights and log(price_4_nights). Which variable should you use for the regression model? Why?

Fit a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.

  • Interpret the coefficient review_scores_rating in terms of price_4_nights.
  • Interpret the coefficient of prop_type_simplified in terms of price_4_nights.

We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explanatory variables in model1 plus room_type.

listings_model <- listings_3 %>% 
  filter(accommodates>=2, minimum_nights<=4) %>% 
  mutate(price_4_nights = price * 4) %>%  
  select(76,34,37,38,36,36,41,42,51,56,58,74,61,67,16,17,18,22,23,26,69,28,32,33,75)

colnames(listings_model)
 [1] "price_4_nights"            "accommodates"             
 [3] "bedrooms"                  "beds"                     
 [5] "bathrooms_text"            "minimum_nights"           
 [7] "maximum_nights"            "availability_30"          
 [9] "number_of_reviews"         "number_of_reviews_l30d"   
[11] "reviews_per_month"         "review_scores_rating"     
[13] "review_scores_value"       "host_response_rate"       
[15] "host_acceptance_rate"      "host_is_superhost"        
[17] "host_listings_count"       "host_total_listings_count"
[19] "host_identity_verified"    "instant_bookable"         
[21] "neighbourhood_cleansed"    "property_type"            
[23] "room_type"                 "prop_type_simplified"     
favstats(listings_model$price_4_nights)
 min  Q1 median   Q3   max     mean       sd    n missing
 140 472    704 1180 1e+05 1106.583 2688.563 1685       0
listings_model_1 <- listings_model %>% 
  filter(price_4_nights <= 1106.58+2688.56)
ggplot(listings_model_1,aes(x = price_4_nights)) +
 geom_density()+
labs(title = "Price distribution for four nights",
                       subtitle = "Density Plot",
                           x = "Price for 4 nights",
                           y = "Density")+
                           NULL

# the density plot without log is quite right-skewed, so we choose to log our Y.
ggplot(listings_model_1,aes(x = log(price_4_nights))) +
  geom_density()+
labs(title = "Price distribution for four nights",
                       subtitle = "Density Plot",
                           x = "Log(Price for 4 nights)",
                           y = "Density")+
                           NULL

# it looks better now

3.1 Model 1

model1 <- lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating,data = listings_model_1)
summary(model1)

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating, data = listings_model_1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.60967 -0.33077 -0.00579  0.31766  1.84554 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           6.7545780  0.1691724
prop_type_simplifiedEntire rental unit               -0.1905313  0.0559638
prop_type_simplifiedEntire residential home           0.0148465  0.0552276
prop_type_simplifiedOther                            -0.6576207  0.0498154
prop_type_simplifiedPrivate room in residential home -1.0489163  0.0547899
number_of_reviews                                    -0.0011476  0.0001068
review_scores_rating                                  0.0913308  0.0332170
                                                     t value Pr(>|t|)    
(Intercept)                                           39.927  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -3.405 0.000679 ***
prop_type_simplifiedEntire residential home            0.269 0.788099    
prop_type_simplifiedOther                            -13.201  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -19.144  < 2e-16 ***
number_of_reviews                                    -10.748  < 2e-16 ***
review_scores_rating                                   2.750 0.006034 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4936 on 1624 degrees of freedom
Multiple R-squared:  0.4186,    Adjusted R-squared:  0.4165 
F-statistic: 194.9 on 6 and 1624 DF,  p-value: < 2.2e-16
# When a categorical variable has k levels, we include (k-1) in the regression model and the one left outside acts as our baseline (or zero). 
# in this case, "Entire condominium (condo)" will be the baseline, and the intercept 6.46 is the mean cost of condo.
#The slope of Entire rental unit is -0.137 – people in Entire rental unit cost on average 0.137 cheaper than the baseline type of Condo

autoplot(model1)+ 
  theme_bw()

# to check the residuals
# there is a pattern in the top left graph,meaning that there are variables in our model that are currently unaccounted for the Y.

car::vif(model1)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.084174  4        1.010153
number_of_reviews    1.042724  1        1.021139
review_scores_rating 1.047398  1        1.023425

Comment: Looking at the Residuals vs Fitted graph, from fitted values 5.5 to 6.5, residuals bounce around the 0 line randomly and do not suggest any presence of positive or negative correlation with fitted values. The former behaviour suggests that the variance of errors is zero. The Normal Q-Q graph shows us a linear positive line which suggests that the data behaves as the normal assumption distribution used for the analysis. This may come across as natural logarithms were applied to the data. Regarding the scale location graph, the blue line is approximately horizontal, indicating that the average magnitude of residuals is not changing as a function of fitted values. Nevertheless, spread around the blue line widens as fitted values increase up to 6.5. Afterwards, the spread decreases and starts increasing again until 7.5. The former behaviour may be an indication of heteroskedasticity. Finally, the residuals vs leverage graph shows that there are some observations that may affect the predictability of the model. As a consequence, not taking into account these data points in our model would improve predictability.

3.2 Model 2

model2 <- lm(log(price_4_nights)~prop_type_simplified+number_of_reviews+review_scores_rating+room_type,data = listings_model_1)
summary(model2) #under 95% CI, room_type is significant

Call:
lm(formula = log(price_4_nights) ~ prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type, data = listings_model_1)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.48015 -0.32051 -0.03474  0.28521  1.64810 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           6.8540134  0.1608274
prop_type_simplifiedEntire rental unit               -0.1989411  0.0519599
prop_type_simplifiedEntire residential home           0.0080909  0.0512741
prop_type_simplifiedOther                            -0.4877513  0.0499914
prop_type_simplifiedPrivate room in residential home -0.7698072  0.0621849
number_of_reviews                                    -0.0010125  0.0001002
review_scores_rating                                  0.0701141  0.0316186
room_typeHotel room                                   0.1618588  0.0824503
room_typePrivate room                                -0.2930125  0.0356993
room_typeShared room                                 -1.3441640  0.0923743
                                                     t value Pr(>|t|)    
(Intercept)                                           42.617  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -3.829 0.000134 ***
prop_type_simplifiedEntire residential home            0.158 0.874637    
prop_type_simplifiedOther                             -9.757  < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -12.379  < 2e-16 ***
number_of_reviews                                    -10.104  < 2e-16 ***
review_scores_rating                                   2.217 0.026728 *  
room_typeHotel room                                    1.963 0.049804 *  
room_typePrivate room                                 -8.208 4.54e-16 ***
room_typeShared room                                 -14.551  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4583 on 1621 degrees of freedom
Multiple R-squared:  0.4999,    Adjusted R-squared:  0.4971 
F-statistic:   180 on 9 and 1621 DF,  p-value: < 2.2e-16
autoplot(model2)+
  theme_bw()

car::vif(model2)
                         GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 2.501720  4        1.121450
number_of_reviews    1.065645  1        1.032301
review_scores_rating 1.101116  1        1.049341
room_type            2.536817  3        1.167835

Comment: As in Model 1, residuals bounce around line 0 with a spread that widens as fitted values increase. As a consequence, the graph suggests that the variance of errors is zero. The normal Q-Q Plot has some data points below the linear positive line but not significantly upper. Therefore, as in model 1, data behaves as expected, according to the normal distribution assumption. Regarding the scale location graph, the blue line is approximately horizontal but slightly positive, indicating that standardized residuals change as fitted values increase. In addition, the spread around the line also increases as fitted values do, suggesting presence of heteroskedasticity. Finally, the residuals vs leverage graph shows the presence of data points affecting predictability of the model. Therefore, not taking into account those data points would help to improve the robustness of the model.

3.3 Further variables/questions to explore on our own

Our dataset has many more variables, so here are some ideas on how you can extend your analysis

3.3.1 Model 3

  1. Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?
listings_model_2 <- listings_model_1 %>% 
  mutate(bathrooms_num = case_when(bathrooms_text=="1 shared bath"~1,
                                   bathrooms_text=="3 baths"~3,
                                   bathrooms_text=="1 private bath"~1,
                                   bathrooms_text=="1 bath"~1,
                                   bathrooms_text=="1.5 shared baths"~1.5,
                                   bathrooms_text=="2.5 shared baths"~2.5,
                                   bathrooms_text=="2 baths"~2,
                                   bathrooms_text=="1.5 baths"~1.5,
                                   bathrooms_text=="2.5 baths"~2.5,
                                   bathrooms_text=="0 baths"~0,
                                   bathrooms_text=="2 shared baths"~2,
                                   bathrooms_text=="4 baths"~4,
                                   bathrooms_text=="3 shared baths"~3,
                                   bathrooms_text=="Half-bath"~0.5,
                                   bathrooms_text=="Shared half-bath"~0.5,
                                   bathrooms_text=="private half-bath"~0.5,
                                   bathrooms_text=="3.5 baths"~3.5,
                                   bathrooms_text=="3.5 shared baths"~3.5,
                                   bathrooms_text=="5 baths"~5,
                                   bathrooms_text=="4.5 baths"~4.5,
                                   bathrooms_text=="4 shared baths"~4,
                                   bathrooms_text=="5 shared baths"~5,
                                   bathrooms_text=="5.5 baths"~5.5,
                                   bathrooms_text=="8 shared baths"~8))

model3 <- lm(log(price_4_nights) ~ bathrooms_num + bedrooms + beds + accommodates,data = listings_model_2)
summary(model3)

Call:
lm(formula = log(price_4_nights) ~ bathrooms_num + bedrooms + 
    beds + accommodates, data = listings_model_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.67221 -0.34389 -0.01925  0.34023  1.80524 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    5.78725    0.03241 178.577  < 2e-16 ***
bathrooms_num  0.03775    0.02703   1.397    0.163    
bedrooms       0.42280    0.02940  14.379  < 2e-16 ***
beds          -0.10790    0.01610  -6.702 2.83e-11 ***
accommodates   0.09773    0.01357   7.204 8.96e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.493 on 1601 degrees of freedom
  (25 observations deleted due to missingness)
Multiple R-squared:  0.3919,    Adjusted R-squared:  0.3904 
F-statistic:   258 on 4 and 1601 DF,  p-value: < 2.2e-16
car::vif(model3)
bathrooms_num      bedrooms          beds  accommodates 
     1.514122      3.356921      2.829508      4.061990 

Comment: According to our model, the number of bathrooms is not a significant predictor of price. This may be the case as the number of bathrooms is not a differentiator for a customer when deciding to book an Airbnb. However, the number of bedrooms, beds, and size of house may influence the experience of the customer, and therefore, the final price of the Airbnb. Our model suggests that those 3 variables are significant. Finally, bedrooms and beds is a clear example of collinearity as obviously, most of the time, having more bedrooms in a flat/apartment requires more beds (correlation of 0.76). Therefore, using both variables in our model would not add value.

3.3.2 Model 4

  1. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
model4 <- lm(log(price_4_nights) ~ host_is_superhost,data = listings_model_2)
summary(model4)

Call:
lm(formula = log(price_4_nights) ~ host_is_superhost, data = listings_model_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.64268 -0.43792 -0.03324  0.43828  1.64589 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)            6.58432    0.02675 246.188   <2e-16 ***
host_is_superhostTRUE -0.01954    0.03338  -0.585    0.558    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6463 on 1629 degrees of freedom
Multiple R-squared:  0.0002103, Adjusted R-squared:  -0.0004034 
F-statistic: 0.3427 on 1 and 1629 DF,  p-value: 0.5584

Comment: After controlling for other variables, the model shows that superhosts do not provide a pricing premium for airbnbs. This may be the case as customers do not consider this characteristic essential and they focus more on other aspects of the flat/apartment. Therefore, landlords do not increase prices because they are superhosts.

3.3.3 Model 5

  1. Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?
model5 <- lm(log(price_4_nights) ~ instant_bookable,data = listings_model_2)
summary(model5)

Call:
lm(formula = log(price_4_nights) ~ instant_bookable, data = listings_model_2)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.71706 -0.44077 -0.03789  0.40320  1.64781 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           6.68687    0.02034  328.82   <2e-16 ***
instant_bookableTRUE -0.28144    0.03180   -8.85   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6314 on 1629 degrees of freedom
Multiple R-squared:  0.04588,   Adjusted R-squared:  0.04529 
F-statistic: 78.33 on 1 and 1629 DF,  p-value: < 2.2e-16

Comment: After controlling for other variables, instant_bookable is a significant predictor. The negative coefficient may suggest that landlords are willing to have their flats/apartment booked as soon as possible and for that reason they price their flats at a discount vs non-instantly bookable flats. On the other hand, customers may think it is convenient to be able to book an Airbnb immediately (a feature common for hotel bookings).

3.3.4 Model 6

  1. For all cities, there are 3 variables that relate to neighborhoods: neighbourhood, neighbourhood_cleansed, and neighbourhood_group_cleansed. There are typically more than 20 neighborhoods in each city, and it wouldn’t make sense to include them all in your model. Use your city knowledge, or ask someone with city knowledge, and see whether you can group neighborhoods together so the majority of listings falls in fewer (5-6 max) geographical areas. You would thus need to create a new categorical variable neighbourhood_simplified and determine whether location is a predictor of price_4_nights
#unique(listings$neighbourhood_cleansed,incomparables=FALSE)

listings_model_3 <- listings_model_2 %>% 
                            mutate(neighbourhood_simplified = ifelse(neighbourhood_cleansed %in% c(
                                        "Financial District", 
                                        "Presidio Heights",
                                        "Seacliff",
                                        "Haight-Ashbury",
                                        "Nob Hill",
                                        "Diamond Heights",
                                        "West of Twin Peaks",
                                        "Russian Hill",
                                        "Noe Valley",
                                        "Golden Gate Park"), "Prime","Other"))

model6 <- lm(price_4_nights ~ neighbourhood_simplified, data = listings_model_3)
summary(model6)

Call:
lm(formula = price_4_nights ~ neighbourhood_simplified, data = listings_model_3)

Residuals:
   Min     1Q Median     3Q    Max 
-799.3 -423.5 -191.7  228.3 2808.3 

Coefficients:
                              Estimate Std. Error t value Pr(>|t|)    
(Intercept)                     871.65      16.93  51.499   <2e-16 ***
neighbourhood_simplifiedPrime    79.61      43.15   1.845   0.0652 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 628.8 on 1629 degrees of freedom
Multiple R-squared:  0.002086,  Adjusted R-squared:  0.001473 
F-statistic: 3.405 on 1 and 1629 DF,  p-value: 0.06518

Comment: There are 36 neighbourhoods in San Francisco. Using our city knowledge, we identify 10 good neighbourhoods and categorize them as “Prime”, and other neighbourhoods fall in the category of “Other”. Then we analyse this simplified neighbourhood variable in model 6 to explore the effect of neighbourhood locations on log price for 4 nights. In this model, the intercept is 871.65 with a standard deviation of 16.93. The neighbourhood variable has a coefficient of 79.61 and a standard deviation of 43.15, meaning that locating in one of the “Prime” neighbourhoods will increase the log price by 79.61, which matches our expectation. While the intercept is significant (a high t-value of 51.499), the coefficient is only significant at a significance level of 0.05 (t-value = 1.845 <2).

3.3.5 Model 7

  1. What is the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?
model7 <- lm(log(price_4_nights) ~ availability_30,data = listings_model_3)
summary(model7)

Call:
lm(formula = log(price_4_nights) ~ availability_30, data = listings_model_3)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.66035 -0.43728 -0.05451  0.44741  1.76889 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)      6.70058    0.02327 287.960  < 2e-16 ***
availability_30 -0.01232    0.00164  -7.514 9.43e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6355 on 1629 degrees of freedom
Multiple R-squared:  0.0335,    Adjusted R-squared:  0.0329 
F-statistic: 56.46 on 1 and 1629 DF,  p-value: 9.429e-14

Comment: In model 7, we explore the effect of availability on the log price for 4 nights. In this model, the intercept is 6.70058 with a standard deviation of 0.02327. The availability variable has a coefficient of -0.01232 and a standard deviation of 0.00164, meaning that for one increase in availability, there will be -0.01232 decrease the log price for 4 nights. The rationale here is that high availability indicates that the property is less popular and thus has lower price. Both the intercept and coefficient are very significant.

Based on our analysis in this model, we conclude that availability_30 is a significant predictor after controlling for other variables.

3.3.6 Model 8

model8 <- lm(log(price_4_nights) ~ reviews_per_month,data = listings_model_3)
summary(model8)

Call:
lm(formula = log(price_4_nights) ~ reviews_per_month, data = listings_model_3)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.64488 -0.43373 -0.03517  0.44219  1.61507 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        6.604684   0.017747 372.158  < 2e-16 ***
reviews_per_month -0.009082   0.002165  -4.195 2.87e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6429 on 1629 degrees of freedom
Multiple R-squared:  0.01069,   Adjusted R-squared:  0.01008 
F-statistic:  17.6 on 1 and 1629 DF,  p-value: 2.872e-05

Comment: In this model, we analyse the effect of reviews per month on log price for 4 nights. The variable for reviews per month has an intercept of 6.604 and a standard deviation of 0.017. This equivalent t value of this test suggests that the null hypothesis can be easily rejected at any level of significance. The interpretation indicates that when the reviews per month variable takes on a value of ‘0’, the average log price for 4 nights is equal to 6.604.

The variable reviews per month has an estimate of -0.009 and a standard deviation of 0.0021. The t value is also significant and indicates that for every review the log price for 4 nights with change -0.009.

3.4 Model 9

# model9 contains all the significant variables we explored before
model9 <- lm(log(price_4_nights) ~ availability_30 + reviews_per_month + neighbourhood_simplified +
                         instant_bookable + host_is_superhost + bedrooms + beds + bathrooms_num + prop_type_simplified + number_of_reviews + review_scores_rating + number_of_reviews_l30d + room_type,data = listings_model_3)
summary(model9)

Call:
lm(formula = log(price_4_nights) ~ availability_30 + reviews_per_month + 
    neighbourhood_simplified + instant_bookable + host_is_superhost + 
    bedrooms + beds + bathrooms_num + prop_type_simplified + 
    number_of_reviews + review_scores_rating + number_of_reviews_l30d + 
    room_type, data = listings_model_3)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.33287 -0.26347 -0.03088  0.25644  1.44196 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           6.149e+00  1.514e-01
availability_30                                       4.444e-03  1.202e-03
reviews_per_month                                     3.755e-05  1.561e-03
neighbourhood_simplifiedPrime                         1.282e-01  2.876e-02
instant_bookableTRUE                                 -1.043e-01  2.220e-02
host_is_superhostTRUE                                 6.024e-02  2.396e-02
bedrooms                                              2.763e-01  2.523e-02
beds                                                  1.407e-02  1.385e-02
bathrooms_num                                         6.439e-02  2.253e-02
prop_type_simplifiedEntire rental unit               -1.571e-01  4.595e-02
prop_type_simplifiedEntire residential home          -1.827e-01  4.651e-02
prop_type_simplifiedOther                            -3.257e-01  4.514e-02
prop_type_simplifiedPrivate room in residential home -6.247e-01  5.615e-02
number_of_reviews                                    -6.055e-04  9.692e-05
review_scores_rating                                  7.519e-02  2.934e-02
number_of_reviews_l30d                               -2.051e-02  3.695e-03
room_typeHotel room                                   3.395e-01  7.611e-02
room_typePrivate room                                -2.112e-01  3.405e-02
room_typeShared room                                 -1.500e+00  1.635e-01
                                                     t value Pr(>|t|)    
(Intercept)                                           40.612  < 2e-16 ***
availability_30                                        3.697 0.000226 ***
reviews_per_month                                      0.024 0.980810    
neighbourhood_simplifiedPrime                          4.457 8.91e-06 ***
instant_bookableTRUE                                  -4.700 2.83e-06 ***
host_is_superhostTRUE                                  2.514 0.012045 *  
bedrooms                                              10.951  < 2e-16 ***
beds                                                   1.016 0.309979    
bathrooms_num                                          2.857 0.004327 ** 
prop_type_simplifiedEntire rental unit                -3.419 0.000645 ***
prop_type_simplifiedEntire residential home           -3.928 8.93e-05 ***
prop_type_simplifiedOther                             -7.216 8.27e-13 ***
prop_type_simplifiedPrivate room in residential home -11.125  < 2e-16 ***
number_of_reviews                                     -6.247 5.36e-10 ***
review_scores_rating                                   2.563 0.010464 *  
number_of_reviews_l30d                                -5.552 3.31e-08 ***
room_typeHotel room                                    4.460 8.76e-06 ***
room_typePrivate room                                 -6.203 7.05e-10 ***
room_typeShared room                                  -9.179  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4046 on 1587 degrees of freedom
  (25 observations deleted due to missingness)
Multiple R-squared:  0.5941,    Adjusted R-squared:  0.5895 
F-statistic:   129 on 18 and 1587 DF,  p-value: < 2.2e-16
autoplot(model9)+
  theme_bw()

car::vif(model9)
                             GVIF Df GVIF^(1/(2*Df))
availability_30          1.272198  1        1.127917
reviews_per_month        1.312397  1        1.145599
neighbourhood_simplified 1.059505  1        1.029322
instant_bookable         1.172699  1        1.082913
host_is_superhost        1.302781  1        1.141395
bedrooms                 3.669330  1        1.915550
beds                     3.111778  1        1.764023
bathrooms_num            1.562141  1        1.249856
prop_type_simplified     3.456565  4        1.167698
number_of_reviews        1.257991  1        1.121602
review_scores_rating     1.213977  1        1.101806
number_of_reviews_l30d   1.457407  1        1.207231
room_type                4.709159  3        1.294664

Comment: After regressing the reviews per month variable on log price of 4 nights with all our significant variables, we can observe several tendencies. In the first residuals vs fitted plot, we can observe that this presents the characteristics of an appropriate model. The residuals of the model float around the 0 line, suggesting the linear relationship is correct. The overall form seems to be horizontal, which indicates the variances of the error terms are equal.

The normal Q-Q plot follows the plot of a normal distribution and does not present any outliers along the distribution line. The scale location plot presents the blue line horizontal across the plot. This indicates homoscedasticity is present in this regression model. That is, the spread of the residuals are roughly equal at all fitted values. The residuals seem to be scattered randomly around the blue line, although slightly more present above the line than under. It would be reasonable to assume they all have similar variability at all fitted values. Throughout the residuals vs leverage plot, we can perceive that the spread of the residuals tends to decrease as leverage increases, indicating the possibility of heteroscedasticity. The spread of residuals should remain constant regardless of the amount of leverage. The residuals all seem to be close to the blue line, indicating that no individual residual is having a large impact on the model.

3.5 Model 10

# the model we choose:

model10<- lm(log(price_4_nights) ~ availability_30 + neighbourhood_simplified + instant_bookable + bedrooms + prop_type_simplified + number_of_reviews + review_scores_rating + room_type,data = listings_model_3)
summary(model10)

Call:
lm(formula = log(price_4_nights) ~ availability_30 + neighbourhood_simplified + 
    instant_bookable + bedrooms + prop_type_simplified + number_of_reviews + 
    review_scores_rating + room_type, data = listings_model_3)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.40124 -0.27617 -0.03625  0.25856  1.50340 

Coefficients:
                                                       Estimate Std. Error
(Intercept)                                           6.200e+00  1.514e-01
availability_30                                       3.419e-03  1.187e-03
neighbourhood_simplifiedPrime                         1.334e-01  2.846e-02
instant_bookableTRUE                                 -1.263e-01  2.181e-02
bedrooms                                              3.200e-01  1.696e-02
prop_type_simplifiedEntire rental unit               -1.624e-01  4.640e-02
prop_type_simplifiedEntire residential home          -1.805e-01  4.689e-02
prop_type_simplifiedOther                            -3.405e-01  4.538e-02
prop_type_simplifiedPrivate room in residential home -6.323e-01  5.591e-02
number_of_reviews                                    -7.539e-04  9.036e-05
review_scores_rating                                  8.025e-02  2.876e-02
room_typeHotel room                                   3.282e-01  7.500e-02
room_typePrivate room                                -1.993e-01  3.330e-02
room_typeShared room                                 -1.367e+00  8.561e-02
                                                     t value Pr(>|t|)    
(Intercept)                                           40.964  < 2e-16 ***
availability_30                                        2.881 0.004016 ** 
neighbourhood_simplifiedPrime                          4.686 3.02e-06 ***
instant_bookableTRUE                                  -5.791 8.39e-09 ***
bedrooms                                              18.868  < 2e-16 ***
prop_type_simplifiedEntire rental unit                -3.499 0.000479 ***
prop_type_simplifiedEntire residential home           -3.849 0.000123 ***
prop_type_simplifiedOther                             -7.504 1.02e-13 ***
prop_type_simplifiedPrivate room in residential home -11.309  < 2e-16 ***
number_of_reviews                                     -8.343  < 2e-16 ***
review_scores_rating                                   2.790 0.005331 ** 
room_typeHotel room                                    4.376 1.29e-05 ***
room_typePrivate room                                 -5.984 2.68e-09 ***
room_typeShared room                                 -15.969  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4087 on 1617 degrees of freedom
Multiple R-squared:  0.6031,    Adjusted R-squared:  0.5999 
F-statistic:   189 on 13 and 1617 DF,  p-value: < 2.2e-16
autoplot(model10)+
  theme_bw()

car::vif(model10)
                             GVIF Df GVIF^(1/(2*Df))
availability_30          1.265461  1        1.124927
neighbourhood_simplified 1.029556  1        1.014670
instant_bookable         1.122926  1        1.059682
bedrooms                 1.634221  1        1.278367
prop_type_simplified     3.303017  4        1.161085
number_of_reviews        1.089254  1        1.043673
review_scores_rating     1.145368  1        1.070219
room_type                3.002912  3        1.201131

Comment: Model 10 presents almost identical results for the first three plots previously mentioned. The exception lies with the residuals vs leverage plot. In this plot, we can see that leverage does not have an impact on residuals for almost all the residuals present. There seem to be some residuals at the end of the blue line which could be outliers impacting the results of the model. Overall, this plot indicated the model is heteroskedastic.

3.6 Diagnostics, collinearity, summary tables

  1. Create a summary table, using huxtable that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.
Comparison of models
Model 1Model 2Model 3Combined ModelFinal Model
(Intercept)6.755 6.854 5.787 6.149 6.200 
(0.169)(0.161)(0.032)(0.151)(0.151)
prop_type_simplifiedEntire rental unit-0.191 -0.199      -0.157 -0.162 
(0.056)(0.052)     (0.046)(0.046)
prop_type_simplifiedEntire residential home0.015 0.008      -0.183 -0.180 
(0.055)(0.051)     (0.047)(0.047)
prop_type_simplifiedOther-0.658 -0.488      -0.326 -0.341 
(0.050)(0.050)     (0.045)(0.045)
prop_type_simplifiedPrivate room in residential home-1.049 -0.770      -0.625 -0.632 
(0.055)(0.062)     (0.056)(0.056)
number_of_reviews-0.001 -0.001      -0.001 -0.001 
(0.000)(0.000)     (0.000)(0.000)
review_scores_rating0.091 0.070      0.075 0.080 
(0.033)(0.032)     (0.029)(0.029)
room_typeHotel room     0.162      0.339 0.328 
     (0.082)     (0.076)(0.075)
room_typePrivate room     -0.293      -0.211 -0.199 
     (0.036)     (0.034)(0.033)
room_typeShared room     -1.344      -1.500 -1.367 
     (0.092)     (0.163)(0.086)
bathrooms_num          0.038 0.064      
          (0.027)(0.023)     
bedrooms          0.423 0.276 0.320 
          (0.029)(0.025)(0.017)
beds          -0.108 0.014      
          (0.016)(0.014)     
accommodates          0.098           
          (0.014)          
availability_30               0.004 0.003 
               (0.001)(0.001)
reviews_per_month               0.000      
               (0.002)     
neighbourhood_simplifiedPrime               0.128 0.133 
               (0.029)(0.028)
instant_bookableTRUE               -0.104 -0.126 
               (0.022)(0.022)
host_is_superhostTRUE               0.060      
               (0.024)     
number_of_reviews_l30d               -0.021      
               (0.004)     
Number of observations1631     1631     1606     1606     1631     
Adj. R Squared0.416 0.497 0.390 0.590 0.600 
Residual SE0.494 0.458 0.493 0.405 0.409 
  1. Finally, you must use the best model you came up with for prediction. Suppose you are planning to visit the city you have been assigned to over reading week, and you want to stay in an Airbnb. Find Airbnb’s in your destination city that are apartments with a private room, have at least 10 reviews, and an average rating of at least 90. Use your best model to predict the total cost to stay at this Airbnb for 4 nights. Include the appropriate 95% interval with your prediction. Report the point prediction and interval in terms of price_4_nights.
data_for_predict <- listings_model_3 %>% 
  filter(room_type == "Private room",
         number_of_reviews_l30d>=10,
         review_scores_rating>=0.9)


#data_for_predict <- data.frame(availability_30=30, neighbourhood_simplified ="Prime" , instant_bookable=TRUE, bedrooms=1 , prop_type_simplified="Private room in residential home" ,  number_of_reviews=10 , review_scores_rating=4 , room_type="Private room") 

data_for_predict
# A tibble: 15 × 26
   price_4_nights accommodates bedrooms  beds bathrooms_text minimum_nights
            <dbl>        <dbl>    <dbl> <dbl> <chr>                   <dbl>
 1            392            2        1     1 1 private bath              1
 2            260            2        1     1 1 private bath              1
 3            280            4        1     1 1 private bath              1
 4            300            4        1     2 1 private bath              1
 5            520            2        1     1 1 private bath              1
 6            248            3        1     1 1 shared bath               1
 7            228            2        1     1 1 shared bath               1
 8            364            2        1     1 1 private bath              1
 9            376            2        1     1 1 shared bath               2
10            344            2        1     1 1 private bath              1
11            372            2        1     1 1 shared bath               2
12            612            2        1     1 1 private bath              1
13            328            2        1     1 1 shared bath               1
14            552            2        1     1 1 private bath              1
15            336            2        1     1 5 shared baths              2
# … with 20 more variables: maximum_nights <dbl>, availability_30 <dbl>,
#   number_of_reviews <dbl>, number_of_reviews_l30d <dbl>,
#   reviews_per_month <dbl>, review_scores_rating <dbl>,
#   review_scores_value <dbl>, host_response_rate <dbl>,
#   host_acceptance_rate <dbl>, host_is_superhost <lgl>,
#   host_listings_count <dbl>, host_total_listings_count <dbl>,
#   host_identity_verified <lgl>, instant_bookable <lgl>, …
# When we plug this multi-row data frame into predict(), it'll generate a
# prediction for each row

model_prediction <- data.frame(predict(model10, newdata = data_for_predict, interval = "confidence")) %>% 
  mutate(Price = exp(fit),
         CI_lower = exp(lwr),
         CI_upper = exp(upr)) %>% 
  select(4,5,6)

model_prediction
      Price CI_lower CI_upper
1  321.6923 287.8543 359.5080
2  295.7143 269.9815 323.8997
3  337.0595 318.3466 356.8724
4  332.2504 312.6587 353.0697
5  359.5702 338.2829 382.1972
6  357.3532 338.6215 377.1211
7  401.1296 380.4416 422.9426
8  447.1335 423.1091 472.5222
9  367.4627 345.5347 390.7823
10 444.6029 408.1545 484.3061
11 525.9696 498.0300 555.4766
12 529.3585 500.1927 560.2250
13 462.7024 435.9064 491.1456
14 517.2307 488.5524 547.5924
15 549.0524 518.0488 581.9115

Comment: The inputs of the model are a private room in a 1 bedroom residential house in a prime neighbourhood. The constraints are a minimum of 10 reviews and an average rating of more than 4. The model predicts a confidence interval of 410.3619 - 495.8104, with an average estimate price of 451.0672.

4 Deliverables

  • By midnight on Monday 18 Oct 2021, you must upload on Canvas a short presentation (max 4-5 slides) with your findings, as some groups will be asked to present in class. You should present your Exploratory Data Analysis, as well as your best model. In addition, you must upload on Canvas your final report, written using R Markdown to introduce, frame, and describe your story and findings. You should include the following in the memo:
  1. Executive Summary: Based on your best model, indicate the factors that influence price_4_nights. This should be written for an intelligent but non-technical audience. All other sections can include technical writing.
  2. Data Exploration and Feature Selection: Present key elements of the data, including tables and graphs that help the reader understand the important variables in the dataset. Describe how the data was cleaned and prepared, including feature selection, transformations, interactions, and other approaches you considered.
  3. Model Selection and Validation: Describe the model fitting and validation process used. State the model you selected and why they are preferable to other choices.
  4. Findings and Recommendations: Interpret the results of the selected model and discuss additional steps that might improve the analysis

Remember to follow R Markdown etiquette rules and style; don’t have the Rmd output extraneous messages or warnings, include summary tables in nice tables (use kableExtra), and remove any placeholder texts from past Rmd templates; in other words, (i.e. I don’t want to see stuff I wrote in your final report.)

5 Acknowledgements